38 research outputs found

    Towards End-to-end Non-autoregressive speech applications

    Get PDF
    In the speech research community, a very challenging topic researchers are interested in is the sequence-to-sequence labeling problem. Speech is a complicated signal, and many tasks aim to assign a sequence of various labels to the signal. Before, the traditional hybrid approach models this problem as a combination of different stages. A separate intermediate label sequence is introduced at each stage, and a component is optimized to model the probabilities. In contrast, end-to-end models have recently become increasingly popular. Sequence-to-sequence models, a type of end-to-end model, take the sequence as input and predict the target sequence directly, which is more intuitive. They usually predict one label at each time, widely known as autoregressive models. Autoregressive models are easy to train and have an excellent theoretical explanation connecting the probability chain rule. This simplicity also results in inefficiency for the inference, particularly with those lengthy output sequences. This becomes a severe problem in reality when the output sequence is particularly long, like a sequence of characters. To speed up the inference procedure, researchers started to be interested in another type of sequence-to-sequence model, known as non-autoregressive models. In contrast to the autoregressive models, non-autoregressive models predict the whole sequence within a constant number of iterations. However, non-autoregressive model training is more challenging compared to autoregressive models. In this dissertation, we propose two different types of non-autoregressive models for speech applications: mask-based approach and noise-based approach. To demonstrate the effectiveness of the two proposed methods, we explored their usage for two essential topics: speech recognition and speech synthesis. The two novel directions proposed in this dissertation provide a good tradeoff between performance and decoding speed and are important for the non-autoregressive speech research field. They allow researchers to apply larger sophisticated networks in their research and companies can also use those methods for businesses to provide service with better quality under time and budget constraints. Some of the methods in this dissertation are not limited to speech applications and may facilitate neural network research in other fields, like neural machine translation or image captioning
    corecore